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ABSTRACT 

During the origin of life, the biological information of 
nucleic acid polymers must have increased to 
encode functional molecules (the RNA world). 
Ribozymes tend to be compositionally unbiased, 
as is the vast majority of possible sequence space. 
However, ribonucleotides vary greatly in synthetic 
yield, reactivity and degradation rate, and their 
non-enzymatic polymerization results in compos- 
itionally biased sequences. While natural selection 
could lead to complex sequences, molecules with 
some activity are required to begin this process. 
Was the emergence of compositionally diverse se- 
quences a matter of chance, or could prebiotically 
plausible reactions counter chemical biases to 
increase the probability of finding a ribozyme? Our 
in silico simulations using a two-letter alphabet 
show that template-directed ligation and high con- 
catenation rates counter compositional bias and 
shift the pool toward longer sequences, permitting 
greater exploration of sequence space and stable 
folding. We verified experimentally that unbiased 
DNA sequences are more efficient templates for 
ligation, thus increasing the compositional diversity 
of the pool. Our work suggests that prebiotically 
plausible chemical mechanisms of nucleic acid 
polymerization and ligation could predispose 
toward a diverse pool of longer, potentially 
structured molecules. Such mechanisms could 
have set the stage for the appearance of functional 
activity very early in the emergence of life. 



INTRODUCTION 

The biology of modern organisms is based on RNA, DNA 
and proteins, but this biochemistry was probably preceded 
by a stage in which RNA molecules acted both as chemical 



catalysts and carriers of genetic information. Evidence 
for this early stage of life (the 'RNA world') includes the 
similarity of ancient chemical cofactors to certain 
ribonucleotides and the discovery that the catalytic core 
of the ribosome is composed of RNA (1-5). Possible 
pathways for the prebiotically plausible synthesis of the 
components of RNA and the polymerization of 
ribonucleotides have been reported by several groups 
(3,6-8). While natural selection could enhance low cata- 
lytic activity, the very earliest ribozymes must have arisen 
through chemical processes (3). Understanding the details 
of this initial emergence is a deep conceptual puzzle (9). 

The first ribozymes must have emerged from pools of 
short sequences that were low in diversity and information 
content, but it is unclear how the complexity of these 
pools could be increased (10,11). These sequence pools 
would have been limited for at least two reasons. First, 
monomers would have different abundances because they 
are synthesized and degraded by different pathways. For 
example, a concentrated eutectic phase solution of 
ammonium cyanide yields significantly more adenine 
than guanine, uracil and cytosine (roughly lOx or more) 
(6). Degradation affects the nucleobases differently, with 
cytosine being particularly susceptible to spontaneous de- 
amination (12). Indeed, the abundances of nucleobases 
detected in meteorites also vary by one or more orders 
of magnitude (13-16). Second, the rate at which different 
monomers are polymerized can vary by an order of mag- 
nitude (6,17-20). In one study of montmorillonite- 
catalyzed RNA synthesis, this bias led to a large reduction 
in diversity, as only 3 out of 32 possible pentamer 
sequences were formed in detectable amounts from a 
mixture of activated A and C monomers (21). These 
biases reduce the diversity of the sequences generated, re- 
stricting exploration of sequence space and thus reducing 
the probability of generating a sequence with biological 
function. While compositional biases might increase the 
probability of generating functional RNA (22-24), the 
magnitude of the biases associated with prebiotically 
plausible polymerization is still substantially larger than 
potentially favorable biases. Interestingly, ribozymes with 
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limited compositional diversity have been made by a com- 
bination of rational design and in vitro evolution on re- 
stricted alphabets, but the resulting ribozymes exhibited 
decreased catalytic efficiency (the three-letter alphabet 
gave a 2500-fold decrease in k cat relative to the four-letter 
alphabet; the two-letter alphabet gave a further 10-fold 
decrease in k cat as well as a large decrease in total 
product conversion due to ribozyme misfolding) (25,26). 
While fine-tuning the conditions — e.g. by adjusting 
monomer ratios to counteract reduced reactivity or 
limiting UV irradiation to attain an appropriate 
monomer ratio (7,27,28) — could potentially overcome 
these biases, such conditions would be unlikely early on. 
Therefore, we sought more general mechanisms to counter 
compositional bias in nucleic acid pools undergoing 
prebiotically plausible reactions. Our experiments using 
DNA and simulations of binary sequences demonstrate 
that template-directed ligation is one such mechanism. 
Our RNA folding simulations suggest that composition- 
ally diverse sequences are more likely to fold into stable 
structures compared with the substantially biased se- 
quences that would be derived from template-independent 
processes. Greater stability is one of the factors promoting 
greater functional activity in RNA aptamers (29,30). In 
addition, our simulations indicate that template-directed 
ligation would shift the pool toward longer sequences, 
another important factor for activity (31,32). Our results 
suggest that a broad exploration of interesting sequence 
space was possible in prebiotic sequence pools despite 
initial chemical biases. 



MATERIALS AND METHODS 

Simulations 

Two types of simulation were performed (stochastic and 
deterministic). Both simulations are limited for computa- 
tional reasons. The stochastic simulation keeps track of 
each monomer or oligomer and implements a specific 
reaction at each time step. In practice, the simulation 
cannot keep track of an infinite number of reactants and 
possible reactions, so the system is limited by the total 
number of monomers considered. This mimics a protocell 
containing a relatively small number of monomers 
(e.g. 400, in which case the longest possible sequence 
would be 400 monomers). The stochastic simulation can 
generate sequences of ribozyme length. In contrast, the 
deterministic simulation keeps track of all possible 
species and reactions among them simultaneously, 
mimicking a very large, well-mixed system. In practice, 
the deterministic simulation cannot keep track of an 
infinite number of species, so the system must be truncated 
at a certain maximum length (e.g. 12). 

The stochastic simulation was based on the Gillespie 
algorithm (33). Each simulation began with a pool of 
monomers. The total number of monomers in the 
system was typically limited to 400. During each iteration, 
an exponential waiting time was generated before a single 
reaction (concatenation, template-directed ligation or hy- 
drolysis) occurred according to the relative rates of all 
possible reactions. Concatenation could occur between 



any two monomers or polymers. Realistic bias was 
introduced in the simulation as either a concatenation 
rate that depended on the identity of the 5' monomer 
(e.g. reactivity ratio of 19:1), or as a difference in the 
initial number of monomers of each type (e.g. abundance 
ratio of 9:1). Template-directed ligation was possible if a 
6-mer segment of one sequence (the template) was com- 
plementary to the trimer at the 3'-end of one substrate and 
the trimer at the 5'-end of the second substrate. Any 
sequence of length 6 or greater was a potential template; 
any sequence of length 3 or greater was a potential sub- 
strate. Circularization reactions were not considered. 
Hydrolysis of phosphodiester bonds occurred at a 
constant rate per bond. Characteristics of the system 
(average length and Q) were measured at exponentially 
increasing time steps (i.e. 0, 1,2, 4, 8, etc), and steady state 
was considered to be achieved when the characteristics at 
consecutive time steps deviated by <0.1%. SEs for these 
measurements were calculated from multiple simulation 
runs. The deterministic simulation used the same reactions 
as the stochastic simulation and kept track of the abun- 
dances of all possible sequences as the system evolved. The 
system was truncated at a maximum polymer length (typ- 
ically 12) for computational tractability (i.e. polymers of 
the maximum length could not undergo further concaten- 
ation or ligation). Average Q, average length and diver- 
sity were computed using steady-state abundances. The 
simulations were performed on the Odyssey Cluster of 
the FAS Research Computing Group at Harvard 
University. See Supplementary Data for simulation 
details. 

Experiment: degenerate oligonucleotides for 
template-directed ligation of a heterogeneous sequence 
pool of DNA (four bases) 

In heterogeneous pool reactions, all oligonucleotides were 
composed of A,C,G,T. Degenerate DNA oligonucleotides 
were obtained from Keck Oligo Synthesis Resource (Yale 
University, New Haven, CT, USA). Octamers and tem- 
plates (40-mer) were synthesized as 5'-NNNN. . . using the 
facility's standard procedure for equimolar, degenerate 
oligonucleotides. Oligos were purified by reverse-phase 
cartridge. Octamers were phosphorylated as noted 
below, using non-radiolabeled ATP. 

Experiment: template-directed chemical ligation reactions 
(four bases) 

Reagents were purchased from Sigma-Aldrich (St Louis, 
MO, USA) unless otherwise specified. Synthetic degen- 
erate oligonucleotides (templates of length 40; 
5'-phosphorylated substrates of length 8) were mixed 
and ligated using cyanogen bromide following a previ- 
ously published procedure (34). Reactions contained 
1-2 uM DNA template and 16 uM DNA octamers. 
DNA sequences were mixed with buffer [0.23 M 2-(N- 
morpholino)ethanesulfonic acid, pH 7.4] and 19mM 
MgCl 2 , in 4.5 ul of aqueous solution, heated to 95° for 
3min and annealed by cooling on the benchtop for 
15min. The solution was placed on ice for 5min and 
0.5 ul CNBr (5 M in acetonitrile) was added (final 
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0.5 M). After 1 min, the reactions were quenched by 100 ul 
ethanol, ethanol precipitated for >lh at —20° and 
centrifuged at 12 000g at 4° for 30 min. Supernatant was 
removed and the pellets were dried on the benchtop over- 
night. The pellets were resuspended in 50 ul of water and 
aliquots were mixed with loading buffer for electrophor- 
esis through a 24% urea-polyacrylamide gel following a 
standard protocol. 

Experiment: next-generation sequencing of templates and 
products of template-directed ligation (four bases) 

Ligation products of length 16 were purified by urea- 
PAGE. Products were dephosphorylated with alkaline 
phosphatase following the manufacturer's protocol (all 
enzymes were obtained from New England Biolabs, 
Ipswich, MA, USA, unless otherwise specified), phenol- 
chloroform extracted and ethanol precipitated. Products, 
templates and octamers were resuspended and ligated to a 
barcoded 3' adapter sequence for Illumina sequencing 
using T4 RNA ligase. The barcoded 3' adapter sequences 
were (lowercase = RNA): 

TAG1 : 5'-ucgTGTCGTATGCCGTCTTCTGCTTGTddC, 
TAG2: 5'-ucgCATCGTATGCCGTCTTCTGCTTGTddC, 
3-TAG1 : 5' -ucgT ACTCGT ATGCCGTCTTCTGCTTGT 
ddC, 

3-TAG2: 5'-ucgACATCGTATGCCGTCTTCTGCTTG 
TddC, 

3-TAG3: 5'-ucgCTATCGTATGCCGTCTTCTGCTTGT 
ddC. 

Products were gel purified by urea-PAGE and 
phosphorylated by PNK following the manufacturer's 
protocol, phenol-chloroform extracted and ethanol 
precipitated. Phosphorylated products were then ligated 
to the 5' adapter sequence for Illumina sequencing using 
T4 RNA ligase. The 5' adapter sequence was: 5'-AATGA 
TACGGCGACCACCGACAGGTTCAGAGTTCTA 
Caguccgacgauc. Products were gel purified by urea- 
PAGE and reverse transcribed with Superscript III RT 
(Invitrogen, Carlsbad, CA, USA). cDNA products were 
gel purified on urea-PAGE and directly sequenced on an 
Illumina (Solexa) Genome Analyzer II. Polymerase chain 
reaction was not performed in order to avoid bias due to 
differential PCR amplification. As controls, template and 
substrate DNA were also prepared for sequencing follow- 
ing the same procedure. 

Experiment: sequence analysis (four bases) 

Sequence reads of the appropriate length containing the 
appropriately barcoded 3' adapter sequence were ex- 
tracted. The compositional diversity of each sequence 
was calculated for subsequence length k = 3. To eliminate 
effects due to the total length of the sequences, length 16 
was chosen for analysis. Products were already of this 
length. For templates (length 40), C 3 was calculated for 
all 16-mers contained in the template and the average was 
used as the C 3 of the template. For octamers (length 8), 
simulated 16-mer 'products' were generated in silico by 
randomly joining octamer sequences obtained 



experimentally, and the C3 of these simulated random 
products was calculated. 

Experiment: template-directed ligation with a single 
template (two bases) 

Each reaction contained 1-2 uM DNA template and 
16uM random radiolabeled DNA octamers (Eurofins 
MWG Operon, Huntsville, AL, USA and Sigma- 
Aldrich, St Louis, MO, USA). In single-template reac- 
tions, the template oligonucleotides were composed of C 
and T while the octamers were composed of A and G. The 
templates used in the reactions of Figure 3b were: 

5'_TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, 

5'-CTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT, 

5'-TCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCT, 

5'-CCTCTTCTCCTCTTCTCCTCTTCTCCTCTTCT, 

5'-CTCTTCTTTTCTCCTCTTCTTTTCTCCTCCCC. 

This set of templates utilizes a relatively small subset 
(25) of the theoretically possible octamers. These 25 
octamers were mixed in equimolar ratio, phosphorylated 
and radiolabeled by T4 polynucleotide kinase (New 
England Biolabs, Ipswich, MA, USA) and added to 
the reaction, which was performed as described earlier. 
Band intensities were measured using a Typhoon TRIO 
Variable Mode Imager (Piscataway, NJ, USA). See 
Supplementary Data for more details. 

DEFINITIONS AND MODEL 

Measuring compositional bias for a given sequence 

To measure the compositional bias, we use a definition 
inspired by grammar complexity and the Shannon 
entropy (35-37). For a sequence s of length L, we define 
H k (s) by 

Hk{s) = -^^logzfe), 

i 

where i is the index of a unique string of length k (k < L\ 
i = 1 to 2 for a binary string) and p t is the frequency of 
the rth /c-mer within s. If k = 1, H k is just the Shannon 
entropy. The H k for various k describe the compositional 
bias of a sequence, and the H k for k > 1 reveal information 
that Hi by itself does not. Consider this hypothetical 
scenario: '0' tends to follow T and vice versa, such that 
most sequences are alternating (e.g. '01010101. . .'). In this 
case, Hi would be high even though the sequence is not 
very diverse. However, H 2 is low, reflecting the lack of 
diversity on the '2-mer' scale, and so on for higher k. 

The H k have the advantage of being straightforward to 
compute and relevant for RNA folding, and measuring 
the diversity of potential template subsequences (see 
below). Like any simple measure, however, they suffer 
from some drawbacks. First, H k for a fixed k can 
suggest high internal sequence diversity when the 
sequence is actually highly regular. Some well-known 
measures of complexity [e.g. the number of states in the 
smallest finite state machine accepting the sequence (38), 
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Table 1. Example RNA sequences, compositional diversity and folding energy predicted by 
Viennafold (41) 



Sequence (50-mer) 


c 4 


E m 
(kcal/mol) 


GAUUGCCUAUAAACCAUUUAAUAGCCCAGCAACAUGCACAAAUGGGCGUU 


1.0 


-9.1 


CAAAAAAAAAACAAUAAUAAAUGGACAAUAACCCCAUUAAAAGAAGAUAC 


0.84 


-3.0 


AUAAAAAAAAAAGAAAAAUUACAUAAAAAAAAGAAAUAAAACAAAAUAAC 


0.65 


0 


AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 


0 


0 


AUGGUAAGUUCCCAAGGCGGGUUGGAAGAGAUAUCAUAGGAGCUUGUCGU 
(GTP aptamer 10-59) 


1.0 


-11.9 


AUAUAUAUAUAUAUAUAUAUAUAUAUAUADAUAUADAUAUADAUAUAUAU 


0.18 


-20.6 



In rows 2 and 3, adenine is 5- or 10-fold more abundant than the other bases, respectively, represent- 
ing a realistic bias in composition (6). The fifth row contains the sequence of a known aptamer (42); 
other sequences were computer-generated. The sixth row illustrates an unusual sequence with low C4 
and low folding energy. 



or algorithmic complexity, the length of the shortest 
program generating the sequence (39,40)] do not suffer 
from this limitation, but they are intractable to compute 
and do not incorporate information about the scale at 
which the underlying chemical processes are operating. 
Second, the calculation of H k for k > 1 involves non- 
independent, overlapping sequences. This would be ap- 
propriate for measuring the diversity of potential 
ligation sites. However, one might imagine a scenario 
with much competition for binding to the same template 
(high concentrations of similar sequences); in that case, 
overlapping sites might interfere with one another and 
perhaps an alternative measure of diversity would be 
more appropriate. Nevertheless, we find H k to have heur- 
istic value for the analysis described below. 

The maximum possible H k (Hf ax ) grows with total 
sequence length L. To focus on composition rather than 
length, we define the relative compositional diversity C k of 
a sequence to be the ratio H k (s)IHf i>L {L). In addition, we 
only compare C k for sequences of the same length. C k (s) 
measures the heterogeneity of strings of length k within 
sequence s. C k (s) is zero when all subsequences are iden- 
tical, and C k (s) is one when all possible subsequences 
appear equally often in s (Table 1). For our purpose, if 
k is chosen to be too small, the compositional bias of 
longer £>mers is not captured, but if k is too large, C k 
cannot distinguish well among different sequences. We 
generally used k = 3 except when analyzing long se- 
quences (50-mer), when we used k = 4. 

We find C k to be a useful measure of the diversity of 
k-mers within sequence s, because values of C k close to 1 
would be desirable in the RNA world for at least two 
reasons. First, high C k characterizes the vast majority of 
sequence space, because the number of different sequences 
corresponding to a particular composition is greater if the 
composition is more uniform (43). The total number of 
possible unique sequences varies approximately exponen- 
tially with C k (Figure la). Biases in monomer composition 



and reactivity would decrease the average C k and thus 
restrict the exploration of sequence space. For example, 
a 10-fold bias in composition decreases the average C k 
from 0.94 to 0.43, which represents a severe restriction 
in sequence space given the exponential dependence 
(Supplementary Data). While sequence space might 
contain many potentially structured molecules (44), any 
search through sequence space for which the average C k 
is low would under-represent or omit a large fraction of 
possible sequences. Therefore, high average C k is desirable 
for finding rare, functional molecules. 

Second, C k appears to be correlated with RNA folding 
energy. Intuitively, an internally diverse sequence would 
present more independent opportunities for finding com- 
plementary regions compared to a repetitive sequence of 
the same length. A simple toy model for hairpin formation 
verifies this effect (Supplementary Data). Known ribo- 
zymes have nearly maximal C 4 (average C 4 of ribozyme 
sequence families of length 40-60 is 0.970 ± 0.013; 
Figure lb, Supplementary Data) (45). On the other 
hand, some repetitive sequences have both low C k and 
low folding energy [e.g. (AU)„; Table 1]. It is conceivable 
that high C k in ribozymes could be merely a reflection of 
the fact that many were isolated from pools of nearly 
random sequences, which have high average C k . To under- 
stand the relationship between C k and folding, we 
computed the minimum folding energy (E m ) for a large 
number of random RNA sequences (Figure lc and 
Table 1). The fact that most random sequences have 
high C k would cause a spurious correlation between C k 
and E m simply because a greater range of E m is sampled by 
more sequences at high C k . To determine the correlation 
between C k and E m independently of this effect, we binned 
the RNA sequences according to C 4 and analyzed an 
equal number of unique sequences in five bins centered 
between 0.6 and 1. We also limited our analysis to se- 
quences with GC content of 40-60%, to avoid potential 
effects due to GC content alone. The folding energies are 
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Figure 1. Compositional diversity, sequence space and predicted RNA 
folding energy, (a) Most of sequence space is of high compositional 
diversity. Histogram of C4 for RNA sequences, computed from random 
sampling of 10 9 sequences of length 50 (black dots) in sihco. The 
complete histogram for all possible sequences of shorter length is com- 
putable and is similar to that of the random sample of 50-mers (length 
10 = blue, 12 = pink, 14 = green, 17 = orange), (b) Compositional di- 
versity (C4) and predicted minimum folding energy (E m ) for known 
ribozymes (length 40-60; see Supplementary Data) (45) are shown as 
blue dots with mean and SD (blue lines), (c) C4 versus E m (black dots) 
predicted by Viennafold (41) for 2.5 x 10 6 RNA sequences of length 50. 
To minimize effects from GC-content, we restricted the in silico 
sampling to sequences whose GC content is 40-60%. To avoid 
sampling artifacts, sequences were assigned to five bins according to 
C4, and an equal number of unique sequences were analyzed in each 
bin. The bin averages are shown as the red line (see Supplementary 
Data for values and SDs). 



correlated with C4, and collectively the Q, explain 61% of 
the variance in E m according to principal components re- 
gression analysis (Supplementary Data). There is a 
notable paucity of energetically stable, low C 4 sequences. 

While stable folding is believed to be a prerequisite for 
function (46), is not a perfect predictor of function. For 
example, sequence libraries containing deliberate repeti- 
tive patterns (alternating purine/pyrimidine) would have 
lower Cjt on an average, but they perform at least as well 
as random libraries during SELEX because the design 
favors hairpin formation (46). Also, some rare structures 
might only be formed if the composition is biased. For 
example, sequences depleted in U and enriched in G are 
more likely to form stable structures, presumably because 
small regions of base-pairing are stabilized by this com- 
position (22). Computational studies suggest that struc- 
tures with long loop regions are favored by enrichment 
for A and C, and the optimal composition depends on 
the desired motif and structure (24). Loop regions in ribo- 
somal RNAs tend to be A-rich, while G, C and U tend to 
comprise the stems (47). RNA folding simulations suggest 
that folding could be improved by a small compositional 
bias (nucleotide frequencies within 2-fold of each other) 
(23). It should be noted that the relationship of sequence, 
structure and function is complex and not yet fully under- 
stood, and well-folded structures are not always func- 
tional; mutations that preserve ribozyme fold might still 
destroy activity. Nevertheless, to the extent that structure 
may be important for function, in general Ck is a measure 
of compositional diversity and structural potential for a 
given sequence. 

Measuring diversity of a pool 

To quantify the diversity among different sequences in a 
pool, we also measure the population-level entropy D of a 
pool of molecules: 

N 

D = -^2 n > lo g2(«;), 

;=1 

where i is an index for unique sequences, N is the total 
number of unique sequences in the pool and «, is the 
fraction of sequences in the pool that consist of copies 
of the zth sequence. D is zero if all molecules are identical 
and D is maximal when molecules are distributed uni- 
formly through sequence space. In contrast to C k (a 
property of each sequence), D is a property of the entire 
pool. 

Simulations of prebiotically plausible reactions 

To understand the effect of prebiotically plausible 
chemical reactions on compositional diversity, we first 
simulated a population of binary sequences undergoing 
three reactions: concatenation, template-directed ligation 
and hydrolysis. During concatenation, the 5'-end of 
one monomer (or polymer) reacts with the 3' -end of 
another monomer (or polymer) with rate constant k con . 
This process first polymerizes monomers into oligo- 
mers, and later joins monomers and oligomers in 
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template-independent reactions. Template-directed 
ligation can occur when two oligomers anneal adjacent 
to one another on a template sequence, leading to the 
ligation of the two oligomers with rate constant /c lig (48). 
Template-directed ligation appears to be a general phe- 
nomenon, occurring with peptides, small molecules and 
nucleic acids (49-51), and it can greatly accelerate bond 
formation (52-54). Interestingly, template-directed 
ligation of oligonucleotides appears to be relatively 
unbiased compared to monomer polymerization, 
permitting the incorporation of nucleotides that are effect- 
ively unreactive as monomers (55). For the purpose of 
modeling, we assume that three or more adjacent 
'Watson-Crick base-pairs' (i.e. O's pairing with l's in 
our two-letter model) are required for annealing 
(34,54,56,57). Finally, hydrolysis of phosphodiester 
bonds, an important process for RNA molecules, occurs 
in our model at a constant rate per bond (7c h ). 



Model parameters 

The parameters of our simulations are the two dimension- 
less ratios: r con = k con c Q /k h , where c 0 is the concentration 
of monomers, which gives the relative strength of concat- 
enation and hydrolysis; and nig = &ii g co/^con, which gives 
the relative strength of template-directed ligation and con- 
catenation. In experiments, c 0 is usually in the millimolar 
range, ?- con is roughly 1-100, and r Vlg is between 10 3 and 10 7 
(see Supplementary Data), so we use these parameters in 
our simulations. For computational tractability, we use a 
two-base system in the modeling for the purpose of 
building intuition. A two-base system has been proposed 
as a progenitor of the four-base system (12,58), and a 
ribozyme can be composed of only two bases (25). 
However, because a four-base system would be more real- 
istic and it has been argued that this alphabet size is 
optimal (59), we also investigated the four-base system 
to the extent that it was computationally tractable. 
Regardless, a four-base system is used in our DNA- 
based experiments testing the predictions of the 
simulations. 

Based on these chemical reactions, we implemented a 
stochastic simulation of a small reactor (e.g. a protocell) 
and a deterministic simulation mimicking a very large 
reactor. The stochastic simulations were initiated with 
~400 monomers (corresponding to a concentration of 
~10mM in a protocell ~50-100nm in diameter). We 
recorded the average Q., D and the length distribution 
after the reactors reached steady state (Supplementary 
Data). While the stochastic results are most relevant to 
prebiotic protocells, we used the deterministic results to 
understand diversity for computational reasons. The de- 
terministic simulations are generalizations of a previously 
described 'prelife' framework (Supplementary Data) 
(60-62). We examined both possible sources of bias: 
(i) biased reactivity and (ii) biased initial monomer abun- 
dance. Based on the reactivity and abundance differences 
of the literature cited earlier, the bias examined in each 
case was roughly one order of magnitude. 



RESULTS 

Simulation: template-directed ligation causes a shift 
toward longer sequences 

In the absence of template-directed ligation, both sources 
of bias resulted in an exponential relationship between 
length and abundance at steady state, with the scaling 
determined by the ratio r con . We give an analytical proof 
of this relationship, which has been seen in other models 
of polymerization (63), in Supplementary Data. While 
increasing concatenation would also create longer 
products due to greater bond formation, even high con- 
catenation rates would still give an exponentially 
decreasing distribution of lengths. In contrast, template- 
directed ligation skewed the distribution qualitatively 
toward longer lengths (Figure 2a and Supplementary 
Data), resulting in a substantial excess of long sequences 
compared to an exponential distribution. The skew may 
occur because this process uses somewhat long substrates 
(greater than or equal to three bases) to make longer 
products in relatively few steps. Since ribozymes and 




10 20 30 40 50 60 70 80 90 100 
Length 



(b) 

0.9 




1 10 1 10 2 

Concatenation ratio (r con ) 

Figure 2. Template-directed ligation increases average length and com- 
positional diversity in silico. (a) Length distribution of binary sequences 
with or without template-directed ligation [/'i ig = 0 (red) or 10 6 (blue); 
'ton = 10 in both cases]. Length is the number of bases per molecule, 
(b) Compositional diversity C 3 at several r con values, with or without 
template-directed ligation, when monomer reactivity is biased [19-fold 
difference between £- coll ; nig = 0 (red) or 10 6 (blue); length = 15]. 
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aptamers typically have a length of 30 bases or greater 
(45), template-directed ligation could improve the chance 
of obtaining functional molecules simply by increasing the 
number of long polymers. For example, as ru g increased 
from 0 to 10 (r con = 10), the mass fraction of 
ribozyme-length sequences (>30 bases) increased from 
0% (numerically undetectable) to >5%. A similar trend 
is seen using a four-base simulation (Supplementary 
Data). 

Simulation: template-directed ligation increases 
compositional diversity 

Template-directed ligation increased the average compos- 
itional diversity C 3 beyond that achieved by concatenation 
alone when analyzing product sequences of the same 
length. When reactivities were biased, C 3 depended on 
the rates of both concatenation and template-directed 
ligation. Without template-directed ligation, a system 
with higher r con had higher average C 3 (Figure 2b). This 
effect appears to be a consequence of mass action, as the 
less reactive monomer is increasingly incorporated into 
polymers when concatenation is fast relative to hydrolysis. 
A simplified analytical model demonstrates this effect 
(Supplementary Data). However, at a given r con , 
template-directed ligation further increased average C 3 
(Figure 2b), particularly at low concatenation rates. A 
four-base system appears to give similar results, with the 
caveat that our analysis was limited by computational 
tractability (Supplementary Data). Since template- 
directed ligation, like concatenation, increased the 
amount of bond formation relative to hydrolysis, one 
possible explanation might again be mass action. 
Therefore, we also measured C 3 as a function of the 
total rate of bond-forming events (concatenation and 
template-directed ligation). At the same rate of bond for- 
mation, template-directed ligation still increased the C 3 of 
the sequences, such that the majority of the increase of C 3 
with r lig was not simply the result of increased bond for- 
mation (Supplementary Data). Another possible mechan- 
ism for this increase is that template-directed ligation is a 
relatively unbiased mode of ligation compared to concat- 
enation. To illustrate this point, we performed simulations 
in which we artificially relaxed the requirement for com- 
plementary base-pairing in template-directed ligation, 
which we call 'relaxed-ligation'. Relaxed-ligation retains 
the effect of introducing unbiased reactions while 
eliminating the more subtle effects stemming from the 
information content of the reacting sequences. Relaxed- 
ligation produces average C 3 that are similar to com- 
parable template-directed ligation simulations 
(Supplementary Data). This result suggests that the 
unbiased nature of template-directed ligation is an import- 
ant explanation for the increase of C 3 with template- 
directed ligation. 

In simulations where monomer abundance was biased, 
concatenation alone resulted in relatively low average C 3 
(~0.4), independent of r con . The effect of template- 
directed ligation was complicated but it tended to 
increase compositional diversity (Supplementary Data). 
This may be due to increased incorporation of the less 



abundant monomer through complementary base 
pairing in addition to the effects detailed below. Overall, 
both sets of simulations suggested that template-directed 
ligation caused a relative increase of compositionally 
diverse sequences. 

Simulation: diversity of the pool 

While Q measures internal heterogeneity in a sequence 
and D measures population-wide diversity, we found 
that these measures were highly correlated in our simula- 
tions across a range of parameters (Supplementary Data). 
Therefore, increased average compositional diversity 
within a sequence implied increased diversity among mol- 
ecules in the pool. 

Experimental: compositional diversity in template-directed 
ligation of DNA 

To experimentally test our main prediction that 
template-directed ligation increases the average compos- 
itional diversity of a heterogeneous pool of sequences con- 
taining all 4 nt, we performed template-directed ligation in 
a pool of sequences made by degenerate DNA oligo- 
nucleotide synthesis with all four bases (A, C, G, T). 
The experiments described here were performed with 
DNA (not RNA). We chose to use DNA because reactiv- 
ity differences during synthesis are well known (64). Slight 
differences in phosphoramidite reactivity result in small 
biases in the composition of a degenerate pool, analogous 
to the larger biases resulting from reactivity differences in 
prebiotic syntheses. Since the reactivity biases in 
phosphoramidite synthesis are relatively small, this experi- 
ment is a stringent test of whether template-directed 
ligation can increase average Q. We studied whether the 
bias would be countered by template-directed ligation. 
Degenerate templates (length 40; four bases) and 
octamer substrates (four bases) were used to perform 
ligation. The size difference between the templates, sub- 
strates and expected ligation products permitted later gel 
purification of the products. Bond formation was 
catalyzed by cyanogen bromide after an annealing step 
(34) and the products of ligation were isolated by gel puri- 
fication. Templates, octamers and products were 
sequenced using the Illumina platform (Supplementary 
Data). We measured the proportion of sequences that 
had C 3 close to that of ribozymes (C 3 of 0.95 or greater) 
and found that the ligation products were significantly 
shifted toward higher C 3 compared to the templates 
(Figure 3a and Supplementary Data). Ligation products 
also had higher C 3 than sequences predicted from random 
concatenation of the sequenced octamers, indicating that 
template-directed ligation could increase average C 3 
beyond unbiased concatenation by favoring composition- 
ally diverse templates. The proportion of ligation products 
of high C 3 was similar to that of a uniform random pool, 
indicating that template-directed ligation quantitatively 
countered the initial bias of synthesis (Figure 3a). The 
shift was not due to artifacts from comparing samples of 
different length, experimental bias during sequencing or a 
shift in GC content (Supplementary Data). We attempted 
to ascertain whether sequence elements from a recently 
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Figure 3. Experimental relationship between compositional diversity and template-directed ligation, (a) Fraction of DNA sequences having high 
compositional diversity (C3>0.95; analysis length = 16) after template-directed ligation in a heterogeneous pool of degenerate oligonucleotides 
including four bases. A greater fraction of reaction products (yellow) have high C 3 relative to the templates (red = average C 3 of 16-mers contained 
in sequenced 40-mer templates) and substrates (orange = C3 of 16-mers from in silico non-templated, random concatenation of experimentally 
sequenced octamers). The fraction of high C3 sequences in a uniform random pool is shown in green. Error bars are SDs from replicate sequencing 
experiments, (b) Polyacrylamide gel showing higher molecular weight products of template-directed ligation for different single templates from a 
binary alphabet. Molecular weight markers are given in the left lane. '0' indicates a reaction without template added. Template C3 increases from left 
(C 3 = 0) to right (C3 = 0.97; see 'Materials and Methods' section for list of sequences). 



described RNA replicase (65) could be found at greater 
frequency in the pool of template-directed ligation 
products compared with random concatenation of the 
octamers; no difference was apparent by this test 
(Supplementary Data), but the importance of this 
finding is tempered by our incomplete understanding of 
the sequence elements supporting ribozyme function. 

The error of our measurements of compositional diver- 
sity in ligation reactants and products could be calculated 
in two ways: (i) SD among experiments, as shown in 
Figure 3a or (ii) SD from bootstrapping subsamples. 
The error (i) reflects the deviation between experiments. 
The error (ii) reflects the sampling error of 
sequencing. The sampling error is similar in magnitude 
to the error between experiments (Supplementary Data). 
Also, a possible source of differences between templates, 
octamers and ligation products is bias introduced by RNA 
ligase during preparation for deep sequencing. This bias 
is most pronounced at the 5'- and 3'-ends of the se- 
quence reads (66). To confirm that the differences in 
measured Q among these samples were not due to arti- 
facts from bias at the ends of the sequences, we 
randomized the first and last base of each sequence read 
(i.e. replaced the 3' and 5' bases with a randomly chosen 
base: A,C,G,T). The Q of this end-randomized set of 
sequences were calculated. We found that end- 
randomization did not affect the conclusion that ligation 
products had significantly higher than the templates 
and octamers (Supplementary Data). 

These results confirmed the prediction that 
template-directed ligation increases the compositional di- 
versity of a heterogeneous pool of DNA sequences 
comprising four bases. One possible mechanism for this 
increase could be that internally diverse sequences were 
better templates. We can calculate the ratio R of the 



probabilities of template-directed ligation occurring on a 
high C k (Phigh) versus low C k (p low ) template: 

^ = ^high = 1 -(1 -pxpjf 
Wow ?ix(l-(l-^ 

where p l is the probability of the template annealing to 
two adjacent fragments and P2 is the probability of bond 
formation. Both p x and p 2 are <1 in practical situations. 
The ratio R is therefore always > 1 (Supplementary Data), 
meaning that high Q sequences are more likely to be tem- 
plates. Therefore, templates with high C* should be more 
likely to propagate their sequence information. 

To test this experimentally, we studied the dependence 
of ligation efficiency on C 3 for different templates. A set of 
binary DNA templates (two bases: C,T; 32-mer) of 
varying C 3 was designed such that any 8-mer subsequence 
within the templates was 1 of 25 known octamer sequences 
(two bases: A,G), allowing the use of a defined set of sub- 
strates. Although this base composition is not a good 
mimic of a prebiotic reaction, it was chosen to minimize 
intramolecular secondary structure in order to focus on 
the effect of sequence heterogeneity. The templates were 
mixed with an excess of radiolabeled 5'-phosphorylated 
binary DNA octamers (A,G; Supplementary Data). All 
sequences had a GC content of 47-50% except for the 
template with C 3 = 0. Ligation reactions were analyzed 
by polyacrylamide gel electrophoresis to visualize higher 
molecular weight products. We found a positive relation- 
ship between C 3 and amount of products formed 
(Figure 3b). A similar trend was observed using random 
degenerate octamers and longer templates (Supplementary 
Data). This suggests that ligation efficiency on heteroge- 
neous templates is one mechanism by which Q increases 
in the products of template-directed ligation. 
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Figure 4. Proposed prebiotic scenario. Monomers first concatenate into compositionally biased short oligomers. When the oligomers are long 
enough to act as templates, template-directed ligation produces relatively long, compositionally diverse sequences. These sequences can fold into 
stable structures, some of which may be catalytically active, leading to the RNA world. 



DISCUSSION 

Our simulations and experiments suggest that compos- 
itional diversity in a pool of nucleic acids could have 
emerged early on, despite biases in monomer abundance 
and reactivity. The increase due to template-directed 
ligation may have at least two causes. First, 
template-directed ligation is relatively unbiased 
compared to concatenation, so it would counter the intrin- 
sic bias of the system. Second, ligation may happen more 
efficiently on a compositionally diverse template, because 
internal sequence correlations in a low Q template reduce 
the number of independent possibilities for ligation. 
Essentially, a compositionally diverse template could 
utilize a greater fraction of the substrate pool while the 
subsequences within a repetitive template would compete 
with each other for substrates. 

While it may be possible to imagine scenarios where 
different biases cancel one another to produce complex 
pools, such finely tuned rates are unlikely in real 
systems. Based on our results using two-letter simulations 
and template-dependent DNA ligation, we can suggest the 
following prebiotic scenario (Figure 4). Initially the 
sequence pool would consist of short, highly composition- 
ally biased sequences resulting from differences in reactiv- 
ity among the nucleotides. However, once these sequences 
become long enough to serve as templates (length > 6), the 
general mechanism of template-directed ligation would 
favor propagation of internally diverse sequences. A 
small increase in diversity would correspond to an expo- 
nentially large increase in the fraction of sequence space 
that would be explored. One should note that the optimal 
compositional diversity for forming secondary structures 
may be less than the maximum possible. For example, 
compositional bias toward GC-rich sequences may be a 
reasonable criterion for identifying non-coding RNAs in 
the genome due to the effect on folding stability (67), 
although formal structure itself does not appear to be a 
good criterion (perhaps because the compositional diver- 
sity of a genome tends to be fairly high, giving a relatively 
large probability of forming structures). Nevertheless, it is 
unclear whether the bias from prebiotic processes would 
be in the correct direction to favor structures, and it would 
still be desirable to increase diversity over the substantial 
~ 10-fold initial compositional bias estimated for prebiotic 
reactions. As the compositional bias disappeared, 
well-folded sequences could emerge. In addition, 
template-directed ligation would rapidly stitch together 
short sequences to produce a qualitative shift toward 



long sequences. The combined effects on diversity and 
length would enable the generation of ribozymes. These 
first inefficient ribozymes could then 'jump-start' the evo- 
lution of sequences with greater function and complexity 
through natural selection, especially within a spatially 
restricted context (44,63,68). 

Although it has been previously hypothesized that 
abstract measures of primary sequence information are 
unrelated to RNA function (11,69), we found a correl- 
ation between folding energy and Q in silico. This 
suggests that internal heterogeneity, which is calculable 
from primary sequence alone, is an interesting measure- 
ment in addition to functional information or genomic 
complexity (which require knowledge of functional 
activity or fitness, respectively) (36,69). An important 
caveat regarding this relationship is that the minimum 
free energy of a sequence is only one of several features 
that would be important for functional activity. Other 
features would also be desirable [e.g. a large energy gap 
between the most stable fold and misfolded structures, or 
low structural 'plasticity' (70)]. Many desirable features 
are poorly understood. Interestingly, our simulations 
results suggest that, although template-independent 
processes result in exponential length distributions, 
template-directed ligation would skew the distribution 
qualitatively toward long sequences. A minimum length 
appears to be required to find certain activities. This was 
demonstrated by a series of selections for isoleucine 
aptamers that differed only by the length of the random 
region; no aptamers were isolated at the shortest length 
(16 bases) (32). Therefore, template-directed ligation may 
increase the probability of finding functional molecules by 
increasing the frequency of long sequences. Our results 
also highlight the importance of templating as a special 
property of nucleic acids during the origin of life: in 
addition to enabling the faithful replication of informa- 
tion, the ability to template could have promoted a search 
of functionally rich regions of sequence space. 

Our modeling, while adequate for generating a hypoth- 
esis that could be experimentally tested, could be made 
more realistic in several ways. The alphabet size could 
be increased to include four bases, although this modifi- 
cation is computationally expensive because longer se- 
quences would be needed to differentiate among the 
more varied compositions. Secondary structure could be 
included, which might interfere with templating and 
protect against degradation (71). The fact that our DNA 
experiments (including a four-letter alphabet and the 
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possibility of secondary structure) show that template- 
directed ligation increases compositional diversity 
suggests that the overall result is not greatly influenced 
by the simplifications in the modeling. The rate of 
non-templated concatenation might depend on the 
length of the reactants, and circularization reactions 
might select against certain lengths. Interestingly, models 
of prebiotic polymerization of monomers that neglect the 
concatenation of oligomers (i.e. strong dependence of rate 
on substrate length) also yield exponential length distribu- 
tions at equilibrium (61-63,72), suggesting that the quali- 
tative distribution is not changed by the inclusion of 
oligomer concatenation. More experimental investiga- 
tion would be required to understand the possible 
dependencies. Advances in the modeling based on such 
realistic features would be worthwhile for a more 
detailed prediction of the outcome of prebiotically plaus- 
ible reactions. Our experiments used DNA because biases 
during synthesis are well known and we expect DNA to 
resemble RNA with respect to annealing of oligonucleo- 
tides. While the details of formation of secondary struc- 
ture differ for DNA and RNA, both types of molecule can 
fold into catalytically active structures (73). Nevertheless, 
an investigation of compositional diversity using RNA 
would be more realistic as a model of the RNA world. 

It is generally assumed that the emergence of ribozymes 
was the result of natural selection for replication in a pool 
of RNA sequences (3,44), but the pathway for generating 
the very first ribozymes is unknown. The probability of 
finding functional sequences also depends on the desired 
activity. Functions that can be performed by short motifs, 
such as aminoacylation or self-cleavage (24,31,74), could 
potentially be found even in highly biased sequence pools 
since the probability of finding the motif is relatively large. 
However, more sophisticated functions that appear to 
require longer motifs, such as an RNA polymerase 
(27,65), would be more likely to arise in a sequence pool 
if the prebiotic compositional bias were at least somewhat 
mitigated. It has been suggested that physical ordering 
effects alone could not produce functional molecules 
(11). Here, we have shown how long, compositionally 
diverse and well-folded sequences might be produced as 
a consequence of prebiotically plausible chemical 
mechanisms. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 1-24 and Supplementary 
References [75-86]. 
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